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Abstract 

We  study  pooled  (or  group)  testing  as  a  method  for  estimating  the  prevalence  of  HIV; 
rather  than  testing  each  sample  individually,  this  method  combines  various  samples  into  a 
pool  and  then  tests  the  pool.  Existing  pooled  testing  procedures  estimate  the  prevalence 
using  dichotomous  test  outcomes.  However,  HIV  test  outcomes  are  inherently  continuous, 
and  their  dichotomization  may  ehminate  useful  information.  To  overcome  this  problem, 
we  develop  a  parametric  procedure  that  utilizes  the  continuous  outcomes.  This  procedure 
employs  a  hierarchical  pooUng  model  and  estimates  the  prevalence  using  the  likelihood  equa- 
tion. The  likelihood  equation  is  solved  using  an  iterative  algorithm,  and  a  simulation  study 
shows  that  our  procedure  yields  very  accurate  estimates  for  a  fraction  of  the  cost  of  existing 
procedures. 
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1     Introduction 

Estimation  of  the  prevalence  of  HIV  is  important  for  evaluating  the  current  status  of  the 
AIDS  epidemic  and  planning  effective  intervention  programs.  However,  precise  estimates  are 
very  difficult  to  obtain  because  HIV  infection  is  not  associated  with  any  clinical  symptoms. 
Because  the  interval  between  the  infection  time  and  the  onset  of  the  clinical  symptoms  of 
AIDS  is  very  long,  the  total  number  of  AIDS  cases  does  not  provide  information  on  the 
recent  incidence  of  HIV.  Consequently,  population  surveys  are  the  primary  mode  of  HIV 
prevalence  estimation. 

This  method  screens  an  unbiased  population  sample  using  an  antibody  test,  such  as 
ELISA,  and  estimates  the  prevalence  from  the  number  of  positive  tests.  However,  even  this 
method  has  its  limitations.  Because  participants  in  a  siurvey  must  give  their  prior  consent, 
self-selection  bias  can  occur  that  results  in  an  underestimation  of  the  prevalence  (Gill,  Adler 
and  Day  1988).  Also,  the  number  of  tests  is  frequently  limited  by  financial  constraints  that 
adversely  affect  the  accuracy  of  the  estimates. 

One  possible  way  to  improve  the  accuracy  of  the  estimates  and  reduce  the  self-selection 
bias  is  pooled  testing.  The  rationale  behind  pooled  testing  is  simple  and  intuitive:  suppose 
that  a  pool  is  formed  from  the  blood  of  ten  (for  example)  individuals  and  is  tested  with  a 
single  antibody  test.  If  the  prevalence  is  sufficiently  small  then  there  is  a  high  probability 
that  all  ten  samples  will  be  HIV  negative.  Moreover,  even  if  the  pool  is  positive,  it  will 
raxely  contain  more  than  one  infected  individual.  Under  these  conditions,  a  single  pooled 
test  provides  nearly  the  same  information  as  ten  individual  tests,  and  the  pooling  method 
achieves  almost  the  same  accuracy  as  individual  testing  but  for  significantly  lower  cost. 
Furthermore,  the  pooling  procedure  preserves  the  anonymity  of  the  participating  individuals, 
and  therefore  can  reduce  the  self-selection  bias  (Gastwirth  and  Hammich  1989). 

In  his  seminal  paper,  Dorfman  (1943)  showed  how  pooled  testing,  which  is  called  group 


testing  in  the  statistical  literature  and  composite  testing  in  the  environmental  statistics 
literature,  can  be  employed  to  efficiently  detect  the  defective  members  of  a  population. 
Group  testing  was  researched  aggressively  in  the  subsequent  years  and  a  large  literature  now 
exists  on  the  topic  (see  Sobel  and  Groll  1958  and  Johnson,  Kotz  and  Wu  1991  and  references 
therein). 

Group  testing  has  also  been  used  to  estimate  the  proportion  of  the  defective  members 
of  a  population  (e.g.,  Sobel  and  Elassoff  1975,  Chen  and  Swallow  1989  and  Lovison,  Gore 
and  Patil  1994).  However,  these  studies  assume  that  tests  have  no  misclassification  errors. 
Motivated  by  the  HIV  prevalence  estimation  problem,  Tu,  Litvak  and  Pagano  (1994)  and 
Gastwirth  and  Hammich  provide  new  insights  into  pooled  testing  by  developing  poohng 
procedures  for  imperfect  tests.  Nevertheless,  existing  studies  suffer  from  two  shortcomings. 
First,  they  assume  that  pooling  does  not  affect  the  sensitivity  and  specificity  of  the  test. 
However,  if  the  group  size  is  large,  some  positive  sera  may  be  excessively  diluted  by  negative 
sera  and  become  undetectable  in  the  pool.  Failure  to  explicitly  capture  this  dilution  effect 
may  result  in  prevalence  underestimation.  Second,  they  assume  that  the  test  results  are 
binary  (i.e.,  infected  or  non- infected),  even  though  the  outcomes  of  ELISA,  known  as  the 
optical  density  readings,  are  inherently  continuous.  To  implement  their  poohng  procedures, 
they  adopt  the  dichotomous  classification  of  the  test  outcomes  dictated  by  the  test  kit 
manufacturers;  this  classification,  in  addition  to  ehminating  useful  information,  is  specifically 
designed  for  individual  testing. 

The  purpose  of  this  paper  is  to  propose  a  parametric  procedure  that  overcomes  these 
two  shortcomings.  Our  procedure  employs  the  hierarcfiical  pooling  model  of  Wein  and 
Zenios  (1995),  which  is  described  in  Section  2,  and  explicitly  captures  both  the  dilution 
effect  and  the  continuous  nature  of  the  test  outcomes.  This  model  is  used  in  Section  3 
to  develop  an  iterative  estimation  algorithm  that  utilizes  the  continuous  optical  density 


readings.  To  implement  the  algorithm,  it  is  necessary  to  calculate  a  quantity  that  cannot  be 
expressed  in  closed  form.  We  overcome  this  problem  by  developing  an  asymptotic  expansion 
that  is  described  in  Section  4.  An  upper  bound  on  the  asymptotic  variance  is  derived  in 
Section  5,  which  is  used  in  Section  6  to  determine  the  sample  size  and  the  pool  size  that 
eflPectively  balance  the  tradeoff  between  the  testing  cost  and  the  accuracy  of  the  estimate. 
A  Monte  Caxlo  simulation  study  is  carried  out  in  Section  7  to  assess  the  performance  of  our 
estimation  procedure.  The  results  indicate  that  the  parametric  procedure  can  achieve  more 
accurate  estimates  and  atteiin  significant  cost  savings  relative  to  existing  poohng  procedures. 
Concluding  remarks  appear  in  Section  8.  '    ' 

2     The  Hierarchical  PooHng  Model 

ELISA  detects  HIV  antibodies  in  the  blood  of  infected  individuals  and  then  translates  the 
antibody  concentration  into  a  continuous  quantity,  known  as  the  optical  density  (OD)  read- 
ing. To  capture  the  continuity  of  the  OD  readings,  we  describe  this  mechanism  as  a  two-level 
hierarchical  statistical  model.  The  probabihty  density  of  the  antibody  concentration  is  spec- 
ified at  the  first  level,  and  the  conditional  density  of  the  OD  reading  given  the  antibody 
concentration  is  determined  at  the  second  level.  The  dilution  effect  is  captured  by  assuming 
that  the  antibody  concentration  in  a  pool  equals  the  average  of  the  individual  concentrations. 

Readers  are  referred  to  our  companion  paper  (Wein  and  Zenios)  for  a  detailed  de- 
scription of  the  hierarchical  model.  In  that  paper,  the  hierarchical  model  was  validated  on 
dilution  series  data  and  pooled  testing  data  using  a  generalized  linear  model.  In  this  section, 
we  briefly  summarize  our  hierarchical  model  and  show  how  to  use  it  to  derive  the  probability 
density  of  pooled  OD  readings. 

Let  p  denote  the  unknown  HIV  prevalence  and  consider  a  pool  consisting  of  m  individ- 
ual blood  samples.  The  pool's  OD  reading  and  antibody  concentration  are  defined  by  X^"^'' 


and  y^"*',  respectively.  These  random  variables  have  probability  densities  fx{x;p,4>,^) 
and  TTy  (y;p),  where  0  and  7  are  nuisance  parameters  that  appear  in  the  second  level  of  the 
hierarchical  model.  Without  loss  of  generahty,  we  assume  that  the  OD  reading  is  normalized 
so  as  to  fall  between  0  and  1. 

The  model's  first  level  specifies  the  density  ■Ky  iV'^P)-  Let  'K+iy)  and  7r_(j/)  be  the 
probability  density  for  the  antibody  concentration  of  infected  and  non-infected  individuals, 
respectively,  and  define  tt*^*^'"'*  =  tt*^  *  ■k_''\  where  *  is  the  convolution  operator.  Let 
T^Y  {y)  be  the  conditional  density  of  V*'"^  given  that  the  pool  consists  of  k  infected  and 
m  —  k  non-infected  individuals.  Then 

n^y'^'^Hy)  =  m7r*^''"^Hmy)  (1) 

and  the  first  level  of  the  model  states  that 

vrr(y;p)  =  t  [fjA^-prM^'^'Hy)-  (2) 

The  model's  second  level  specifies  the  conditional  density  of  the  OD  reading  given  the 
antibody  concentration: 

f{x-An\y)  =    ,     '  ,.    e  ^^^l^  ^i  ,  (3) 

where  the  nuisance  parameters  (j)  and  7  depend  upon  the  test  kit  employed.  By  equations  (2) 
and  (3),  the  density  of  X^"")  is 

fP{x-p,<f>n)  =  t  (7)P'(1  -Pr"'/?'"''(x;</.,7),  (4) 

where 

/j?''"\x;0,7)  =  r  T^'f'"'\y)f{x;<t>n\y)dy.  (5) 

To  complete  the  description  of  the  model,  it  remains  to  determine  the  parameters  0 
and  7,  and  the  densities  7r_  and  7r+.    Because  our  goal  is  to  estimate  p,  we  assume  that 
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these  quantities  can  be  obtained  from  an  off-line  analysis  of  historical  data,  and  that  the 
only  unknown  parameter  is  p.  The  off-line  estimation  of  these  quantities  is  discussed  in 
Subsection  7.2. 

It  is  worth  noting  that  our  model  not  only  captures  the  dilution  effect  and  the  continuity 
of  the  OD  readings,  but  is  also  consistent  with  empirical  evidence  that  identify  two  sources 
of  variability  in  ELISA:  within  sample  (due  to  measurement  errors)  and  between  sample. 
The  first  level  of  the  model  captinres  the  between  sample  variability.  The  second  level  of  the 
model,  in  particular  equation  (3),  allows  repeated  measurements  from  the  same  sample  to 
be  different,  and  hence  captmes  the  within  sample  variability. 

To  improve  readability,  in  the  remainder  of  this  paper  we  suppress  the  notational 
dependence  on  the  parameters  (f)  and  7;  for  example,  we  use  f{x\y)  instead  of  f{x;  0, 7|y). 

3     The  Iterative  Algorithm 

In  this  section,  we  develop  an  algorithm  to  estimate  the  prevalence  of  HIV.  This  is  done 
by  deriving  the  likehhood  equation  from  the  probability  density  function  (4),  and  then 
iteratively  solving  this  equation. 

Suppose  that  we  collect  blood  samples  from  nm  individuals,  pool  them  together  to  form 
n  independent  pools  of  size  m,  and  then  test  the  pools.  The  raw  data  are  x  =  (xi, . . . ,  Xn), 
where  Xj  denotes  the  OD  reading  of  the  ith  pool.  Equation  (4)  implies  that  the  overall 
likelihood  function  is 

^(^;p)  =  n  (e  {^)pHi  -  P)'"-V?''"V.))  •  (6) 

To  obtain  the  maximum  likehhood  estimate  (m.l.e.)  p  of  p,  we  solve  the  likelihood  equation 

a  log  L(x;p)  ^  ^  ^^^ 

dp 


It  is  convenient  to  define  Tk{x;p),  the  conditional  probability  that  a  pool  contains  k  infected 
individuals  given  that  its  OD  is  x, 

Tki^W)  —  /    V T — i •  \P) 

Er=o(7Hi-p)'"-vi^''"^w 

After  some  tedious  but  straightforward  algebra,  equation  (7)  reduces  to 

1     n      m  1  n      m 

-  im  kTk{x,\p)  -  ——  ^  ^(m  -  k)Tk{xi;p)  =  0.  (9) 

P  i=l  A:=0  ^        P  i=l  fc=0 

Because  12T=o'''k{^'^P)  —  !>  expression  (9)  simplifies  to  the  self-consistency  equation 

n      m 

nmp  =  ^^A;rfc(x,;j9).  (10) 

i=l  k=0 

Intuitively,  the  left  side  of  (10)  is  the  expected  number  of  infected  individuals  and  the  right 
side  is  the  conditional  expected  number  of  infected  individuals,  given  the  data. 

Motivated  by  (10),  we  propose  the  following  iterative  algorithm,  where  e  denotes  the 
desired  tolerance  level: 

Step  1.  Start  the  procedure  with  an  initial  estimate  p^°\ 
Step  2.  Obtain  improved  estimates  by  taking 

1        n      771 

p(3+i)^_L^^fcr,(xr,p(^)).  (11) 

Step  3.  Terminate  if  |p**+^^  —  P^^^\  <  e;  otherwise,  return  to  Step  2. 

To  prove  that  the  algorithm  converges,  we  show  that  it  is  an  Expectation-Maximization 
(EM)  algorithm  (see  Dempster,  Laird  and  Rubin  1977).  The  EM  algorithm  is  applicable 
whenever  the  raw  data  can  be  viewed  as  incomplete  observations  from  a  "complete"  data  set, 
and  the  maximum  hkelihood  estimates  for  the  complete  data  set  are  easily  obtainable.  Here, 
the  complete  data  set  consists  of  both  the  observable  OD  readings,  and  the  unobservable 
number  of  infected  individuals  in  the  pooled  samples.  The  details  are  described  in  Appendix 
A. 


4     Analytical  Approximation 

The  algorithm  of  Section  3  requires  the  computation  of  the  conditional  probabihties  Tk{x;  <j)) 
for  A;  =  0, . . . ,  m.  Since  we  have  been  unable  to  obtain  these  probabilities  in  closed  form, 
Monte  Carlo  simulation  is  employed  in  Subsection  7.3  to  estimate  these  probabilities.  An 
analytical  approximation  to  these  probabilities  is  derived  in  this  section;  this  approximation, 
which  assumes  large  pool  sizes,  is  much  less  computationally  intensive  than  the  simulation 
procedure,  and  is  used  in  Section  6  to  derive  eflFective  samphng  designs.  The  large  pool 
size  approximation  is  derived  in  Subsection  4.1,  and  is  employed  in  Subsections  4.2  and  4.3, 
respectively,  to  calculate  asymptotic  expansions  for  the  probability  densities  defined  in  equa- 
tions (4)  and  (5).  A  discussion  of  the  appropriateness  of  the  large  group  size  approximation 
is  deferred  to  Subsection  4.4. 

4.1     A  Large  Pool  Size  Approximation 

Let  yt*^'"*)  denote  the  conditional  antibody  concentration  given  that  the  pool  consists  of 
k  infected  and  m  —  k  non-infected  individuals;  in  this  subsection,  we  prove  that  yt*^-'"' 
converges  in  distribution  to  a  normal  random  variable  as  A;,  m  ^  oo.  Let  /i+  and  /x_  denote, 
respectively,  the  mean  antibody  concentrations  of  infected  and  non- infected  individuals,  and 
let  a+  and  a_  be  the  respective  standard  deviations.  In  addition,  define  fi{p)  —  piJ.+  +  (1  — 
p)//_. 

Proposition  1  Let  k,m  ^^  oo,  such  that  k/m  -^  s.  Then, 

y^(y(fc,m)  _  ^(^))  ^  pj  ^Q^  ^^2  ^  (1  _  ^yj^ 

Proof:   Let  Y^^  and  Y~,  i  —  1,2, .. .  denote  infinite  sequences  of  independent,  identically 
distributed  (iid)  random  variables  with  respective  densities  n+{y)  and  7r_(y).  The  definition 


of  y^*^'*")  implies 


If  we  define 


and 


y(fe,m)  _   ^1     +  ■  .  .  +  Vfc     +Yi     +  .  ■  .  +  y^_fc 


(13) 


m 

y(m)+ 

>^l"  + 

■  ■+V™i- 

sm 

(m)-  _ 

^-1-  +  . 

•  +  ^m-[sm] 

m  —  [sm] 
then  the  multivariate  Central  Limit  Theorem  (Billingsley  1987,  p.  398)  implies  that 


(14) 


iv/7^(FM--M-)j^^HUJ'l    0     ai))     ^--^^-  (15) 

Furthermore,  yt*^''")  — >  5y("^)+  -|-  (i  _  s)Y^^^~  as  m  ^  oo.  Proposition  1  now  follows 
from  the  Converging  Together  Theorem  and  the  Continuous  Mapping  Theorem  (Billingsley, 
Theorems  25.4  and  25.7).  ■ 

Therefore,  the  density  of  y^*^-'"'  is 


where  the  error  is  0(171'^)  (see  Barndorff-Nielsen  and  Cox  1989,  p.  91). 


(16) 


4.2     Approximating  fx      {^) 

In  this  subsection,  we  use  the  asymptotic  distribution  of  y^'^-'"'  to  develop  an  asymptotic 
expansion  for  fx      \^)i  which  is  defined  in  (5).  The  density  fx      {^)  can  be  expressed  as 

/]?•'"' W=  r  n\,''"'\y)f{x\y)dy.  (17) 

J —oo 

Substituting  (16)  into  (17)  and  defining  r  =  k/m  yields 

J-oo]j  2TT{ra+  + (1  -  r)at)         \    2     [ra^  +  [l  —  r)ai  J 

(18) 


As  m  increases,  the  value  of  the  integrand  in  (18)  depends  more  and  more  completely  on  the 
values  of  y  in  the  neighborhood  of  ^(r).  Hence,  we  replace  f{x\y)  by  its  Taylor  expansion 
around  /i(r), 


|(x|M0)4(.-Mr))^g' 


/(x|t/)  =  /(a:|A.(r))  +  (y-Mr))^(x|M0)  +  o(2/-Mr))'^(x|/x(r))  +  O(|y-//(r)|^).   (19) 


If  we  substitute  (19)  into  (18),  then  (16)  implies  that  E{Y^'''"'^ -n{r))  =  0{m-^),  E{Y^''-'^^  - 
^{r)f  =  '•'^++^^-'')'^-  +  0{m-^)  and  E{Y^'''"'^  -  (//(r)))^  =  0{m-^).  Thus,  we  have 

f^^''-\x)  =  fixHr))  +  ^^i  +  (l^-0^-|/(^|^(,))  +  0(^-2).  (20) 

The  calculation  of  ^-^  is  required  to  complete  the  derivation.  This  calculation  was 
performed  using  the  symbolic  differentiation  of  a  computer  algebra  system  (Maple  V)  and 
the  expression  is  given  in  Appendix  B. 

4.3     Approximating  fxi^w) 

The  methodology  of  Subsection  4.2  can  also  be  used  to  derive  an  asymptotic  expansion  for 


fx    i^'^P)^  which  is  defined  in  (4).  Let  cr(p)  =  Jp(^+  +  (1  ~  p)^'-  +  p(l  ~  p)(a*+  ~  A*-)^-  The 
Central  Limit  Theorem  implies  that 

y/^{Y(^)  -  ^{p))  ^  N  {0,a^{p))    as   m  ^  oo.  (21) 

Repeating  the  intermediate  steps  of  Subsection  4.2  yields 

/r  (x;p)  =  fixHp))  +  ^0(^Im(p))  +  0{m-').  (22) 

This  expression  is  used  in  Section  5  to  derive  an  upper  bound  for  the  asymptotic  variance 
of  the  m.l.e.  as  the  number  of  pools  n  — >  oo. 


Objective 

To  Estimate  the  Prevalence  p. 


Known  Parameters 


Raw  Data 

{xi, . . .  ,Xn}  '■  OD  readings  from  pools  of  size  m 


Approximations 


fk{x;p)  = 


dy2 

(T)p''(i-p)"'-*/x'"''w 


Step  1 
Step  2 
Step  3 


The  Algorithm 

Start  the  procedure  with  initial  estimate  p'°'. 

Terminate  if  \p'-^'^^^  —  p'^^l  <  e; 
otherwise,  return  to  Step  2. 


Figure  1:  The  estimation  algorithm  for  the  proposed  estimate. 

4.4     Discussion  of  the  Approximation 

The  asymptotic  representation  for  fx  i^)  can  be  used  to  approximate  Tk{x;p)  and  imple- 
ment the  estimation  algorithm  (see  Figvue  1).  However,  it  is  expected  that  the  algorithm 
will  converge  to  a  fixed  point  that  will  be  different  from  the  m.l.e.  To  distinguish  this  fixed 
point  from  the  m.l.e.,  we  call  it  the  proposed  estimate  and  denote  it  by  p.  Using  (20),  we 
can  show  that  the  difference  between  the  proposed  estimate  and  the  m.l.e.  is  of  Oim^"^)] 
however,  the  proof  is  omitted  here  and  can  be  found  in  Zenios  (1995). 

Regarding  the  appropriateness  of  the  large  group  size  approximation,  we  note  that  the 
empirical  results  in  Wein  and  Zenios  display  a  large  separation  between  OD  readings  for  HIV 
positive  and  HIV  negative  individuals.  Consequently,  our  parametric  procedure  can  adopt 
very  large  pool  sizes  without  significant  loss  of  information,  thereby  partially  justifying  our 
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large  group  size  approximation.  However,  the  accuracy  of  the  asymptotic  expansion  (20) 
depends  on  the  rate  that  y('"'+  and  Y^'^^'  converge  to  normal  random  variables  in  (15). 
Following  the  conventional  rule-of-thumb  (the  sample  mean  of  n  iid  random  variables  is 
approximately  normal  for  n  >  15),  we  conclude  that  if  the  pool  size  m  and  the  number  of 
infected  individuals  in  the  pool,  k,  satisfy  15  <  A;  <  m  —  15,  then  the  normal  approximations 
should  be  sufficiently  accurate  for  practical  purposes. 

The  condition  ^  <  m  —  15  is  apt  to  hold  for  large  pool  sizes.  Although  A;  >  15 
may  be  practical  for  other  statistical  applications,  we  would  expect  k  to  be  much  smaller 
than  15  for  HIV  prevalence  studies.  Approximation  (15)  would  hold  for  small  k  if  TT+{y)  was 
approximately  normal;  however,  this  does  not  appear  to  be  the  case  (see  Figure  7  of  Wein  and 
Zenios).  In  conclusion,  we  do  not  have  a  persuasive  justification  for  (15);  nevertheless,  the 
large  group  size  approximation  developed  in  this  section  performs  well  in  the  computational 
study  described  in  Section  7. 

5     An  Upper  Bound  on  the  Asymptotic  Variance 

This  section  contains  a  derivation,  which  is  based  on  the  information  inequality,  of  an  upper 
bound  on  the  variance  of  the  proposed  estimate.  Let  /i^'"^(p)  =  E{X^'^^\p),  and  let  /i(^('")(p)) 
and  /i(p)  denote,  respectively,  the  Fisher's  information  about  /i^'"'(p)  and  p  contained  in  a 
single  observation  from  X^"^\  which  is  the  OD  reading  of  a  pool  of  size  m.  Because  X^'"^  is 
an  unbiased  estimate  of  /x*'"'(p),  the  Cramer- Rao  lower  bound  implies  that 

Since  Var(p),  which  is  the  asymptotic  variance  of  p  as  n  — >  oo,  equals  l/(n/i(p)),  it  follows 
that 

V„tf)  <  ^-(^'"'Iri,  (24) 
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Let  us  first  calculate  //('"'(p).  Prom  (22),  we  have 

/x('")(p)  =  jf     x/(x|Mp))cix  +  ^  ^    x^(x|Mp))cix  +  0(m-2).  (25) 

Equation  (3)  implies 

/     xj{x\y)dx=-^  (26) 


,7 

y 

and  difl^erentiating  twice  under  the  integral  sign  yields 


Thus,  substituting  (26)  and  (27)  into  (25)  gives 

l  +  Mf))^        2m  {l  +  (/i(p)n'  ^        ;       V     ; 

Differentiating  (28)  yields 


dp      '"■'  (1  +  (/X(p))^)2 

^^(P)  /7(7  +  1)  (A^(P))'""^  -  47^  (/^(P))'""'  +  7(7  -  1)  (/^(p))""\^ 
^  2m\  (1  +  (;,(p))^)^ 

(7  +  1)^  (M(P))'^-'  -  (47^  -  2)  (/x(p))^^-^  -  (7  +  1)^(m(p))^-^^  , 

(1  +  (m(p))')' 

(7  +  1)(7  +  2)  (/x(p))'^-'  -  (47^  -  4)  (m(p))'^-'  -  (7  +  1)(7  +  2)  (/i(p))'^-' 


C    , (29) 


(i  +  (Mp))^)^ 

where  ^4  =  /x_  -  //+,  B  =  a^  -  a'i  +  n^.  -  ^t  and  C  =  cr^^+  -  o-^^_  +  /i+/i?.  -  /i+A*-- 
Carrying  out  a  similar  analysis  for  Vax(X^'"'|p)  leads  to 

V.r(X(-)M        ,.        (/^(P))'  ^  ^'(P)   f'/'(7-l)(Mp)r^ 

(2  +  (40  -  6)7  +  27^)  {MY'-'  +  (7  -  2)(0  -  2  +  27)(/^(p))^^-^>^  .3Q. 

For  brevity  of  notation,  let  h  be  the  first  term  in  the  right  side  of  equation  (29)  and  /  be 
the  coefficient  of  ^  in  (29).  Similarly,  let  v  be  the  first  term  in  the  right  side  of  (30),  and  w 
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be  the  coefficient  of  —  in  (30).  Substituting  (29)  and  (30)  into  (24)  and  omitting  terms  of 
0{m~^),  we  have  the  upper  bound 

V  +  — 
Var(p)  <  ^       +  0{m-').  (31) 

n  (h^  +  ^) 

We  conclude  this  section  with  a  brief  discussion  of  confidence  intervals.   The  Wald's 

1  —  a  confidence  interval  for  p  is 

p±^M=,  (32) 

\/nh{p) 

where  Za/2  is  the  upper  a/2  quantile  of  the  standard  normal  density.  Because  a  closed 
form  expression  for  /i(p)  is  not  available,  we  could  attempt  to  use  the  observed  information 

°dpi        instead.    Although  the  details  are  omitted,  the  resulting  confidence  intervals 

were  tested  in  the  computational  study  in  Section  7,  and  were  not  sufficiently  reliable  for 
practical  use  (e.g.,  the  actual  covering  probabihty  for  the  95%  confidence  interval  was  roughly 
85%).  Alternatively,  we  could  construct  confidence  intervals  that  use  the  upper  bound  (31). 
However,  this  gives  a  covering  probability  that  is  roughly  99%. 

6     Designing  Efficient  Procedures 

To  conduct  a  population  survey  that  adopts  the  algorithm  of  Figure  1,  we  must  first  specify 
the  number  of  pools  n  and  the  group  size  m.  An  efficient  design  must  balance  the  testing 
cost  with  the  acciuracy  of  the  resulting  estimate.  The  design  problem  that  arises  in  practice 
can  be  posed  as  one  of  the  following  two  constrained  optimization  problems:  either  choose  n 
and  m  to  achieve  a  prespecffied  variance  at  minimum  cost,  or  choose  n  and  m  to  minimize 
the  variance  of  the  estimate  subject  to  a  prespecified  budget  constraint.  These  two  problems 
are  closely  related  (more  about  that  later),  and  for  concreteness  we  will  address  the  former 
problem. 
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Let  A  denote  the  prespecified  variance  threshold.  Since  the  true  variance  of  the  esti- 
mate is  not  available  in  closed  form,  we  employ  the  upper  bound  of  the  asymptotic  variance 
derived  in  (31)  as  a  smrrogate.  Oiu  resulting  design  will  be  conservative,  in  that  the  actual 
variance  will  be  less  than  or  equal  to  A.  Of  course,  the  upper  bound  on  the  variance  is  a 
function  of  the  prevalence,  which  must  be  estimated  at  the  design  stage. 

We  assume  that  the  cost  of  testing  a  pool  of  size  m  is  (see  Behets  et  al.  1988  for  a 
detailed  discussion) 

C('-)  =  {'     ^,    '!"":;■  (33) 

[  em  +  f    if  m  >  1, 

where  e  and  /  are  positive  constants.  The  design  problem  is  to  choose  positive  integers  n 
and  m  to 

minimize   nC{m)  (34) 

subject  to  n  >  — -, "^  .  ,  .  (35) 

-A(/^^  +  f) 

It  is  possible  to  solve  this  integer  programming  problem  using  exhaustive  search.  However, 
this  is  cumbersome  and  time  consuming.  Instead,  we  obtain  a  closed  form  solution  to  the 
non-integer  relaxation  of  this  problem. 

Proposition  2  //  {wh?  -  2hlv){fh'^  —  2ehl)  >  0,  then  the  solution  to  the  non-integer  relax- 
ation of  (34)- (35)  IS 


(_r,,.    ,       /  {wh'2-2hlv){fh^  -2ehl)  \ 
0,       ^^      ^, ,  (36) 

?;  -I-  ^^ 
n*  =       ,^'    ,.  (37) 

A(/.^  +  ^) 

The  proof  involves  a  standard  Lagrangian  argument  and  elementary  calculus,  and  is  omitted. 

Similarly,  we  can  solve  the  second  optimization  problem,  which  is  to  choose  m  and  n 

to  minimize  the  upper  bound  of  the  asymptotic  variance  subject  to  nC{m)  <  B,  where  B  is 
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the  prespecified  available  budget.  The  solution  to  the  integer  relaxation  of  this  problem  is 

given  by  (36)  and 

em*  +  f  ,     ^ 

n*  =  — ^.  (38) 

Notice  that  the  pool  size  in  (36)  is  independent  of  the  specified  variance  threshold 
and  the  imposed  budget  constraint.  It  can  be  shown  that  this  pool  size  also  minimizes  the 
cost  per  unit  information  (i.e.,  cost  times  variance).  Although  the  design  in  Proposition  2 
employs  an  upper  bound  on  the  asymptotic  variance  and  is  clearly  suboptimal,  this  result 
should  provide  a  useful  back-of-the-envelope  calculation. 

7     Simulation  Study 

The  results  from  a  Monte  Carlo  simulation  study  are  reported  in  this  section.  The  purpose 
of  this  study  is  to  address  the  following  questions: 

1.  How  accurate  are  the  approximations  of  Section  4? 

2.  What  is  the  value  of  employing  the  continuous  OD  readings  rather  than  the  traditional 
dichotomous  test  results? 

3.  Is  it  necessary  to  exphcitly  incorporate  the  dilution  effect  into  our  analysis? 

4.  What  are  the  expected  benefits  from  pooled  testing? 

The  first  question  is  addressed  by  comparing  the  proposed  parametric  estimator  (p.p.e) 
to  the  exact  m.l.e.  To  answer  question  2,  we  develop  a  class  of  binary  estimators,  and  compare 
the  best  binary  estimator  to  the  exact  m.l.e.  To  address  question  3,  we  compare  the  p.p.e. 
to  two  other  binary  estimators,  one  that  explicitly  captures  the  dilution  effect  and  one  that 
ignores  it.  We  also  consider  individual  testing  in  order  to  provide  an  answer  to  question  4. 
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To  obtain  meaningful  results  from  the  simulation  study,  we  employ  real  data.  The  data 
are  described  in  Subsection  7.1  and  are  used  in  Subsection  7.2  to  estimate  the  parameters  of 
the  simulation  model.  The  various  estimators  are  defined  in  Subsection  7.3,  the  results  of  the 
simulation  study  are  reported  in  Subsection  7.4  and  a  hypothetical  application  is  described 
in  Subsection  7.5. 

7.1  Data 

We  employ  two  data  sets  that  were  collected  and  tested  by  the  National  Reference  Laboratory 
of  Australia  (NRL).  The  NRLl  data  set  consists  of  the  OD  readings  for  4000  HIV  negative 
and  3000  HIV  positive  sera.  The  sera  were  tested  using  a  commercially  available  test  kit, 
and  the  cutoff  classifying  the  test  outcomes  as  positive  or  negative  was  0.05.  The  NRL2  data 
set  consists  of  OD  readings  from  10  infected  sera,  diluted  sequentially  to  a  fixed  negative 
serum  to  produce  a  series  of  10  two-fold  dilutions,  with  ratios  1  :  16, 1  :  32, . . . ,  1  :  8192.  The 
samples  were  tested  in  duplicate  using  10  different  test  kits.  No  individuals  are  common 
to  the  two  data  sets,  and  the  test  kit  in  NRLl  is  not  one  of  the  10  kits  used  in  NRL2. 
Finally,  we  note  that  all  individual  samples  in  NRLl  were  diluted  according  to  the  test  kit 
manufacturer's  instructions  before  being  tested. 

7.2  The  Simulation  Model 

The  simulation  model  randomly  generates  antibody  levels  and  OD  readings  for  both  individ- 
ual and  pooled  samples  via  the  hierarchical  model  defined  in  Section  2.  In  this  subsection, 
we  specify  the  model  parameters  (f)  and  7,  and  the  densities  7r+(y)  and  i^-iy). 

The  parameters  0  and  7  are  solely  dependent  on  the  test  kit  employed,  and  can  be 
estimated  from  dilution  series  data  by  fitting  the  data  to  a  generalized  linear  model;  Wein 
and  Zenios  carry  out  this  procedure  for  the  10  different  test  kits  in  NRL2.  The  ten  estimated 
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values  for  7  (one  for  each  test  kit)  ranged  from  0.09  to  1.11,  and  averaged  0.54.  Evidently, 
the  value  for  7  is  highly  dependent  on  the  test  kit  employed.  Unfortunately,  we  do  not  have 
dilution  series  data  for  the  test  kit  employed  in  the  NRLl  data  set,  which  is  the  data  set 
used  to  estimate  7r+(j/)  and  7r_(y).  For  simplicity,  we  use  the  value  7  =  1.0  in  the  simulation 
model.  Sensitivity  analysis  in  Wein  and  Zenios  shows  that  7  =  0.54  and  7  =  1.0  yield 
qualitatively  similar  results. 

We  employ  the  parameter  0  and  the  densities  7r+(t/)  and  7r_(y)  that  are  used  in  Wein 
and  Zenios,  and  readers  are  referred  to  that  paper  for  details;  here  we  provide  only  a  brief 
summary.  Rather  than  estimate  0  in  a  manner  similar  to  the  estimation  of  7,  Wein  and  Zenios 
obtain  the  estimate  (f)  =  0.0088  as  a  by-product  of  estimating  7r_(y)  from  the  NRLl  data.  The 
densities  TT+{y)  and  7r_(y)  are  difficult  to  obtain  because  the  antibody  concentration  is  not 
an  observable  quantity.  Moreover,  these  densities  are  population  dependent  and  may  vary 
if  the  epidemic  is  nonstationary.  However,  if  the  epidemic  is  progressing  slowly,  then  these 
densities  can  safely  be  assumed  to  be  constant,  and  hence  do  not  have  to  be  re-estimated 
every  time  a  population  survey  is  conducted. 

Using  a  combination  of  exploratory  data  analysis  and  the  EM  algorithm  to  estimate 
7r_(j/)  from  the  4000  OD  readings  for  HIV  negative  individuals  in  NRLl,  Wein  and  Zenios 
conclude  that  the  variabihty  of  OD  readings  for  HIV  negative  individuals  is  due  nearly 
entirely  to  within  sample  variability  and  contains  almost  no  between  sample  variability. 
Consequently,  for  TT-{y)  we  use  the  degenerate  density  with  mean  //_  =  0.0086  and  cr_  =  0.0. 
Exploratory  data  analysis  in  Wein  and  Zenios  suggests  that  the  within  sample  variability 
is  substantially  larger  than  the  between  sample  variabihty  for  HIV  positive  individuals, 
and  they  make  the  simplifying  assumption  that  the  within  sample  variability  is  zero  for 
infected  individuals.    Thus,  X\Y  =  E{X\Y)  for  infected  individuals,  and  combining  this 
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Figure  2:  The  density  iT+{y)  used  in  the  simulation  model. 


with  equation  (3)  we  obtain 


Y  = 


X\Y 


(39) 


^l-X\YJ 

Since  our  estimate  for  7  is  one,  if  we  let  Xi, . . . ,  a; 3000  denote  the  observed  OD  readings  for 
the  3000  infected  individuals  in  the  NRLl  data  set,  then  the  density  TT+iy)  is  specified  by 
the  empirical  density  of  (yr^)  ,  •  ■  • ,  (i-xToo)'  '^^^^  density  is  displayed  in  Figure  2.  For 
ease  of  reference,  the  parameters  of  the  simulation  model  are  provided  in  Table  1. 


7.3     Estimation  Procedures 

In  this  subsection,  we  describe  the  procedures  that  are  considered  in  this  simulation  study: 
the  proposed  parametric  estimator,  a  numerical  procedure  for  the  exact  m.l.e.,  three  distinct 
binary  estimators  that  employ  binary  test  results  rather  than  continuous  OD  readings,  and 
individual  testing. 
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/i_  =  0.0086 

/x+  =  3.79 

cr_  =  0.00 

cr+  =  4.00 
(t)  =  0.0088 

7  =  1.00 

Table  1:  The  parameters  of  the  simulation  model. 

Parametric  Estimators.  The  proposed  parametric  estimator  employs  the  algorithm 
in  Figure  1  with  the  estimates  in  Table  1.  The  p.p.e.  employs  the  pool  size  given  by 
Proposition  2.  The  cost  parameters  are  obtained  from  the  field  study  of  Behets  et  al.,  and 
are  e  =  0.04  and  /  =  1.35;  details  may  be  found  in  Wein  and  Zenios.  Figure  2  displays  the 
derived  pool  size  m*  for  0  <  p  <  1.  Notice  that  m*  >  80  for  p  <  0.22.  The  hierarchical 
pooling  model  was  vahdated  on  pools  of  size  at  most  80  in  Wein  and  Zenios  and,  in  fact, 
80  appears  to  be  the  largest  pool  size  reported  in  ihe  literature  (Gaboon- Young  1992). 
Consequently,  we  do  not  consider  pool  sizes  larger  than  80. 

To  derive  the  exact  m.l.e.,  we  use  the  algorithm  in  Figure  1  with  f^  {x;  (f))  replaced  by 
a  lookup  table  representation  of  the  density.  The  table  entries  are  obtained  from  a  simulated 
data  set  as  follows.  For  each  value  of  k  and  m,  a  sample  of  3000  pools  that  each  consist 
of  k  infected  and  m  —  k  non-infected  individuals  is  generated  from  the  antibody  densities 
7r+(t/)  and  7r_(y).  The  OD  readings  are  generated  using  the  density  in  (3),  and  the  table 
representation  is  obtained  from  the  empirical  density  of  the  simulated  OD  readings. 

When  embedded  in  the  estimation  algorithm  in  Figure  1,  these  representations  produce 
an  estimate  that  is  approximately  equal  to  the  m.l.e.  We  call  this  estimator  the  theoreticai 
parametric  estimator  (t.p.e.),  and  it  provides  the  benchmark  to  which  the  performance  of 
the  p.p.e.  is  to  be  compared.  The  t.p.e.  also  uses  the  pool  sizes  shown  in  Figure  2. 

Binary  Estimators:  To  assess  the  value  of  employing  continuous  OD  readings  rather 
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than  traditional  binary  test  results,  we  also  consider  the  following  class  of  procedures:  test  n 
pools  of  size  m,  classify  each  pool  outcome  as  HIV  negative  or  positive  using  the  OD  cutoff 
M,  and  estimate  the  prevalence  from  the  total  number  of  positive  pools.  Let  Se(m,  u)  and 
Sp(m,  u)  denote  the  sensitivity  and  specificity  for  this  class  of  pooling  procedures,  and  A;  be 
the  total  number  of  positive  pools.  The  prevalence  estimate  is 

-—1       f        Se{Tn,u)  —  k/n        \ 

^"     ~  l,Se(m,M)  +  Sp(m,«)-lj       '  ^     ' 

and  the  asymptotic  variance  of  p  is 

Var(p)  =  ^(1  -  P)'-'"/';-"'^''^-/'"';"'*;".  (41) 

nm^  (Se(m,u)  +  Sp(m,u)  —  1)^ 

where  f{m,u,p)  =  [1  —  (1  —  p)'"]Se(m,M)  +  (1  —  p)'"(l  —  Sp(m,  u)).  Also,  the  cost  per  unit 
information  is 

m'^  {be{m,u)  +bp[m,u)  —  1)'' 

We  consider  three  different  bimary  estimators  that  are  based  on  (40).  The  first  estima- 
tor, called  the  simple  binary  estimator  (s.b.e),  assumes  that  that  Se(m,  u)  and  Sp{m,u)  are 
not  a  function  of  m  (i.e.,  there  is  no  dilution  effect),  and  that  the  cutoff  u  is  determined  by  the 
test  manufacturer.  Consequently,  for  all  m,  Se(m,  u)  and  Sp(m,  u)  represent  the  sensitivity 
and  specificity  under  individual  testing,  as  reported  by  the  test  manufacturer.  The  pool  size 
m  is  chosen  to  minimize  the  cost  per  unit  information  (42);  recall  that  the  pool  size  derived 
in  Section  5  also  minimized  this  quantity.  The  simple  binary  estimator  was  proposed  by  Tu 
et  al.  who  were  the  first  to  derive  equations  (40)-(41).  To  implement  this  procedure,  we 
set  u  equal  to  0.05,  which  is  the  cutoff  for  the  NRLl  data,  and  use  Monte  Carlo  simulation 
(with  It  =  0.05  and  m  =  1)  to  derive  very  precise  estimates  of  the  sensitivity  and  specificity 
under  individual  testing;  these  two  estimates  are  substituted  into  equations  (42)  and  (40)  to 
perform  the  computations  described  above. 
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Figure  3:  The  derived  pool  size  m*. 

The  other  two  binary  estimators  exphcitly  incorporate  the  dilution  effect,  and  hence 
are  expected  to  outperform  the  simple  binary  estimator.  Both  estimators  use  the  same  OD 
cutoff  u  and  pool  size  m;  the  estimators'  differentiating  characteristics  will  be  defined  after 
we  show  how  these  two  parameters  are  calculated.  As  with  the  s.b.e.,  u  and  m  are  chosen  to 
minimize  (42).  However,  the  quantities  Se(m,  u)  and  Sp(m,  u)  in  (42)  are  difficult  to  estimate 
for  a  testing  procedure  that  deviates  from  the  manufacturer's  instructions;  therefore,  we 
develop  approximate  closed  form  expressions  for  these  quantities  in  order  to  find  the  cost- 
minimizing  values  of  u  and  m.  These  expressions  are  obtained  from  the  simplified  pooling 
model  of  Wein  and  Zenios.  This  model  assumes  that  the  OD  reading  X  and  the  antibody 
concentration  Y  are  related  via  In  (  y^)  —  7  ln(y)+e,  where  e  is  a  Gaussian  random  variable 
with  mean  zero  and  variance  a^.  Proposition  1  and  the  simplified  pooling  model  lead  to  an 
asymptotic  approximation  for  the  probabiUty  density  of  In  {-[ZxW^)^  where  X^*^-'"'  is  the 
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conditional  OD  reading  for  a  pool  of  rn  given  that  it  contains  k  infected  individuals.  This 
approximation  can  be  used  to  derive  the  expressions: 


Se(m,u)  = 


u-7ln(^//+  +  (l-^K) 


l-(l-p)' 


1  ,2    kal  +  {m-k)cri 


(43) 


'u  —  7ln(/x_) 


Sp(m,  ^.)  =  $    ^:^^:^    .  (44) 


'a2 


In  (43)-(44),  we  set  7  equal  to  1,  and  use  the  approach  in  Subsection  7.1  of  Wein  and  Zenios 
to  estimate  a^  and  modify  //+  and  a+  so  that  a  conservative  estimate  for  the  left  tail  of 
TT+{y)  is  obtained.  The  resulting  estimates  are  a  =  0.42,  ^+  =  2.732  and  a+  =  1.3032.  The 
approximations  (43)  and  (44)  are  substituted  into  equations  (41)  and  (42)  in  steps  2  and  3, 
respectively,  of  the  following  algorithm: 

Step  1.  Set  m  =  1. 

Step  2.  Use  steepest  descent  to  obtain  a  cutoff  u{m)  that  achieves  a  local  minimum  of 
Var(p). 

Step  3.  If  F{Tn,u{m))  >  F{m  —  l,u{m  —  1))  or  m  >  80,  stop;  otherwise,  set  m  <—  m  +  1 
and  return  to  Step  2. 

Upon  termination,  this  algorithm  gives  a  locally  optimal  procedure  that  satisfies  the  con- 
straint m  <  80.  Table  2  displays  the  resulting  pool  sizes  and  cutoffs  for  the  seven  prevalence 
values  that  are  considered  in  the  simulation  study  of  the  next  subsection. 

Notice  that  the  binary  estimate  p  in  (40)  is  a  function  of  Se(m,  u)  and  Sp(m,  u),  where 
m  and  u  are  derived  from  the  three-step  algorithm  described  above.  Our  two  binary  esti- 
mators differ  by  how  they  approximate  Se(m,  u)  and  Sp(m,  u)  in  (40).  The  proposed  binary 
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0.1 

66 

0.5 

62 

1 

54 

5 

20 

10 

12 

15 

8 

20 

7 

p(xlO^)     Pool  Size     Cutoff        Se           Sp  Se  Sp 

0.026  0.8704  0.996  0.7385  0.9812 
0.022  0.9407  0.988  0.7804  0.9425 
0.021  0.9671  0.9858  0.8562  0.9362 
0.027  0.9978  0.9974  0.9361  0.9857 
0.033  0.9996  0.9992  0.9680  0.9975 
0.038  0.9998  0.9999  0.9780  0.9993 
0.042  0.9999  0.9999  0.9804  1.0000 

Table  2:  Optimal  binary  pooling  procedures. 


estimator  (p.b.e.)  uses  equations  (43)-(44)  to  approximate  these  two  functions;  these  ap- 
proximate estimates  for  the  sensitivity  and  specificity  are  reported  inthe  foiurth  and  fifth 
columns  of  Table  2.  We  also  used  Monte  Carlo  simulation  to  derive  precise  estimates  of 
Se(m,  w)  and  Sp{m,u).  In  this  case,  equation  (40)  gives  the  best  possible  binary  estimator, 
which  we  call  the  theoretical  binary  estimator.  The  simulated  estimates  for  Se{m,  u)  and 
Sp(m,  u)  axe  reported  in  the  last  two  columns  of  Table  2. 

The  three  binary  estimators  serve  different  purposes.  The  theoretical  estimator  is  very 
tedious  to  implement  in  practice  and  is  only  included  to  answer  question  2.  In  contrast, 
both  the  simple  and  the  proposed  binary  estimators  are  rather  easy  to  implement  and  can 
be  used  to  address  questions  3  and  4. 

Individual  Testing:  Finally,  to  assess  the  value  of  employing  pooled  testing  rather 
than  individual  testing,  we  consider  the  individual  estimator,  which  is  identical  to  the  simple 
binary  estimator,  except  that  we  choose  m  =  1. 

7.4     Simulation  Results 

In  this  subsection,  we  report  the  simulation  results  from  three  experiments.  First  we  com- 
pare the  p.p.e.    to  the  theoretical  parametric  estimator,  then  we  compare  the  theoretical 
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parametric  estimator  to  the  theoretical  binary  estimator,  and  finally  we  compare  the  p.p.e., 
the  p.b.e.,  the  s.b.e.  and  the  individual  estimator. 

Experiment  1:  A  sample  of  80,000  antibody  concentrations  is  generated  from  the 
probability  mixture  pK^{y)  +  (1  —  p)7r_(y).  The  simulated  blood  samples  are  divided  into 
n  =  1000  pools  of  size  m  =  80,  and  the  OD  readings  are  generated  using  the  conditional 
density  (3).  The  p.p.e.  and  the  theoretical  parametric  estimator  are  obtained  under  seven 
different  scenarios  of  varying  prevalence,  and  the  experiment  is  repeated  400  times  for  every 
scenario.  The  simulation  model  was  implemented  in  C,  on  a  Sun  Sparc  Station  20. 

For  the  p.p.e.  p,  the  estimator  mean  E{p\p)  and  the  estimator  variance  Var(p|p)  are 
obtained  from  the  simulation  experiment.  We  also  compute  the  relative  bias 

rbias(p)  =  l:i3M^  (45) 

P 

the  relative  variance  error 

\^Y""H  -  Var(pb) 

revar(p)  =  — ^^ (46) 

P 

and  the  mean  squared  error  (mse),  E[{p  —  p)'^\p].  The  analogous  quantities  are  also  deter- 
mined for  the  theoretical  parametric  estimate  p. 

The  results  are  given  in  Table  3.  Not  surprisingly,  the  theoretical  parametric  estimator 
is  unbiased.  In  contrast,  the  p.p.e.  is  biased,  and  tends  to  understate  the  true  prevalence. 
The  relative  bias  decreases  as  the  prevalence  increases.  Also,  the  relative  variance  error  is 
positive,  which  confirms  that  (31)  provides  an  upper  bound  on  Vax(p|p). 

Experiment  2:  Once  again,  80,000  (different)  simulated  blood  samples  are  generated 
for  the  seven  scenarios  in  Table  2.  The  blood  samples  axe  divided  into  pools  of  size  80  to 
obtain  the  theoretical  parametric  estimator.  In  addition,  the  same  blood  samples  are  divided 
into  pool  sizes  given  in  Table  2  to  derive  the  theoretical  binary  estimator. 
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Prevalence 

Performance  Measures 

Theoretical 

Proposed 

0.001 

mean 

0.00096 

0.00085 

rbias  (%) 

0 

15 

mse  xlO^ 

0.02 

,      0.04 

variance  x 

;10« 

0.02 

0.02 

revar  (%) 

34 

39 

0.005 

mean 

0.0050 

0.0046 

rbias  (%) 

0 

8 

mse  xlO^ 

0.09 

0.27 

variance  x 

10« 

0.09 

0.10 

revar  (%) 

39 

31.8 

0.01 

mean 

0.01 

0.0094 

rbias  (%) 

0 

5.7 

mse  xlO^ 

0.20 

0.60 

variance  x 

10« 

0.20 

0.23 

revar  (%) 

31 

19 

0.05 

mean 

0.05 

0.048 

rbias  (%) 

0 

3.6 

mse  xlO^ 

1.25 

4.20 

variance  x 

10^ 

1.25 

1.23 

revar  (%) 

14 

15 

0.1 

mean 

0.10 

0.097 

rbias  (%) 

0 

3.2 

mse  xlO^ 

2.8 

12.9 

variance  x 

10« 

2.76 

2.74 

revar  (%) 

6 

7 

0.15 

mean 

0.149 

0.146 

rbias  (%) 

0 

3.0 

mse  xlO^ 

4.50 

24.2 

variance  x 

10^ 

4.22 

4.05 

revar  (%) 

7 

10 

0.2 

mean 

0.20 

0.194 

rbias  (%) 

0.1 

2.8 

mse  xlO^ 

6.10 

38.30 

variance  x 

10^ 

6.05 

5.74 

revar  (%) 

2 

7.4 

Table  3:  Comparison  of  the  theoretical  and  proposed  parametric  estimates. 
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Confidence  Intervals 

Relative 

Relative 

p(xl02) 

Paxametric  (xlO^) 

Binary  (xlO^) 

Efficiency 

Cost  Efficiency 

0.1 

0.096  ±  0.0028 

0.104  ±  0.0032 

1.590 

1.395 

0.5 

0.500  ±  0.0059 

0.530  ±  0.0069 

1.757 

1.479 

1 

1.000  ±  0.0088 

0.993  ±  0.0093 

1.673 

1.289 

5 

5.000  ±  0.0219 

5.114  ±  0.0255 

5.401 

2.551 

10 

10.000  ±  0.0328 

10.041  ±  0.0298 

5.495 

2.217 

15 

14.900  ±  0.0416 

15.000  ±  0.0364 

7.633 

2.808 

20 

20.000  ±  0.0484 

20.080  ±  0.0468 

10.639 

3.831 

Table  4:  Comparison  of  the  theoretical  parametric  and  binary  estimators. 

Table  4  contains  a  comparison  of  the  theoretical  parametric  and  binary  estimators.  The 
first  column  of  Table  4  gives  the  prevalence  and  the  next  two  columns  contain  95%  confidence 
intervals  for  the  prevalence  estimates  under  the  parametric  and  binary  procedures.  The  third 
column  shows  the  relative  efRciency  of  the  parametric  with  respect  to  the  binary  estimator, 
and  the  fourth  column  gives  the  corresponding  relative  cost  efRciency;  these  two  quantities 
are  defined  as  follows:  Suppose  that  the  parametric  estimator  uses  a  pool  of  size  m'  and 
requires  n'  observations  to  attain  a  predetermined  variance  r^.  Furthermore,  suppose  that 
the  binary  estimator  uses  a  pool  of  size  m  and  requires  n  observations  to  attain  t'^.  The 
relative  efficiency  is  the  ratio  n'/n,  and  the  relative  cost  efficiency  is  n'C{m')/{nC{m)). 

Table  4  shows  that  both  estimators  are  unbiased  and  their  variances  are  comparable. 
However,  because  the  parametric  estimator  adopts  a  larger  pool  size,  both  the  efficiency 
and  cost  efficiency  ratios  are  always  greater  than  one.  Therefore,  the  parametric  estimator 
can  achieve  the  same  variance  as  the  binary  estimator  for  a  smaller  sample  size  and  testing 
cost.  Furthermore,  the  benefits  from  the  theoretical  parametric  estimator  appear  to  be  more 
substantial  when  the  prevalence  is  high.  Intuitively,  when  the  prevalence  is  high,  the  binary 
estimator  is  forced  to  adopt  moderate  pool  sizes  in  order  to  minimize  the  probability  that  a 
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pool  contains  more  than  one  infected  individual;  pools  that  contain  more  than  one  infected 
individual  hide  information  on  the  true  prevalence.  In  contrast,  the  theoretical  parametric 
estimator  has  the  flexibility  to  employ  substantially  larger  pool  sizes  because  it  can  infer 
the  total  number  of  infected  individuals  in  each  pool  from  the  OD  reading.  Therefore,  the 
parametric  estimator  can  adopt  a  substantially  larger  pool  size  and  still  achieve  the  same 
variance  as  the  binary  procedure.  This  situation  changes  when  the  prevalence  is  low  because 
then  the  optimal  group  size  for  the  binary  estimator  is  close  to  80,  and  the  group  size  for  the 
parametric  procedure  is  exactly  80;  recall  that  we  have  introduced  the  restriction  m  <  80. 
Hence,  the  upper  bound  restricts  the  flexibihty  of  the  parametric  procedure  and  forces  it  to 
behave  like  the  binary  procedure. 

This  experiment  confirms  our  intuition  that  the  continuous  OD  readings  can  be  used 
to  produce  very  precise  estimates.  Unfortunately,  the  estimators  in  this  experiment  are  hard 
to  employ  in  practice.  In  the  next  experiment  we  use  more  practical  estimators  that  contain 
several  approximations.  Although  these  estimators  are  expected  to  be  biased,  we  will  see 
that  the  bias  is  offset  by  their  high  efficiency,  and  the  procedures  that  utihze  the  hierarchical 
pooling  model  produce  very  precise  estimates. 

Experiment  3:  Once  again,  80,000  (different)  simulated  blood  samples  are  generated 
for  the  seven  scenarios  in  Table  2.  The  blood  samples  are  divided  into  pools  of  size  80  to 
obtain  the  proposed  parametric  estimator.  The  same  blood  samples  are  also  divided  into  the 
pool  sizes  given  in  Table  2  to  derive  the  proposed  binaxy  estimator.  In  addition,  these  blood 
samples  are  divided  into  the  pool  sizes  dictated  by  the  simple  binary  procedure  in  order  to 
derive  the  s.b.e.  Finally,  the  samples  are  also  tested  individually  to  calculate  the  individual 
estimator. 

Table  5  gives  the  mean  squared  error  (mse)  for  the  four  estimators.  We  observe  that 
the  mse  for  the  simple  and  individual  estimators  are  from  4  to  40  times  larger  than  the  mse 
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P(X102) 

Individual 

s.b.e. 

p.b.e. 

p.p.e. 

0.1 

1.000 

0.850 

0.028 

0.046 

0.5 

4.000 

10.464 

0.073 

0.216 

1 

9.000 

35.705 

0.465 

0.471 

5 

59.983 

91.113 

14.278 

4.082 

10 

123.936 

79.731 

30.211 

10.902 

15 

190.891 

74.199 

38.991 

20.606 

20 

220.886 

60.135 

59.618 

38.354 

Table  5:  Comparison  of  the  mean  squared  errors  (xlO^)  of  the  individual,  simple  binary, 
proposed  binary  and  proposed  parametric  estimators. 

of  the  two  new  estimators.  By  faiUng  to  incorporate  the  dilution  effect,  the  simple  binary 
estimator  introduces  a  substantial  bias  that  dramatically  inflates  the  mse.  Although  the 
individual  estimator  is  the  only  unbiased  estimator  of  the  four,  its  variance  is  much  larger 
than  the  estimators  that  capture  the  dilution  effect.  Hence,  pooled  testing,  if  used  properly, 
can  provide  very  precise  estimates  for  the  prevalence  of  HIV. 

Now  we  compare  the  two  practical  estimators:  the  p.p.e  and  the  p.b.e.  Table  6  gives 
95%  confidence  intervals  for  the  prevalence  estimates  together  with  the  relative  efficiency  of 
the  p.p.e.  with  respect  to  the  p.b.e.,  and  the  relative  cost  efficiency.  Surprisingly,  the  binary 
procedure  outperforms  the  parametric  procedure  when  the  prevalence  is  less  than  or  equal 
to  0.01.  Table  4  shows  that,  at  least  in  theory,  the  parametric  procedure  should  be  more 
efficient  than  the  binary  procedure.  However,  its  relative  efficiency  in  this  table  is  rather 
small  when  p  <  0.01.  It  appears  that  the  proposed  approximations  are  more  accurate  in 
the  binary  setting  than  the  parametric  setting  (at  least  when  the  prevalence  is  small),  and 
this  improved  accuracy  more  than  offsets  the  difference  in  Table  4  when  the  prevalence  is 
sufficiently  small. 

We  also  carried  out  all  three  experiments  with  sample  sizes  of  800  and  8000.    The 
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Confidence  Intervals 


Relative  Relative 

p(xl02)       p.p.e.  (xlO^)         p.b.e.  (xlO^)       Efficiency     Cost  Efficiency 


0.1 

0.085  ±  0.0030 

0.110  ±  0.0026 

0.911 

0.798 

0.5 

0.465  ±  0.0060 

0.500  ±  0.0053 

1.007 

0.847 

1 

0.949  ±  0.0090 

0.944  ±  0.0075 

1.028 

0.794 

5 

4.832  ±  0.0220 

4.638  ±  0.0214 

3.774 

1.783 

10 

9.713  ±  0.0320 

9.468  ±  0.0267 

4.651 

1.866 

15 

14.592  ±  0.0390 

14.400  ±  0.0339 

7.575 

2.770 

20 

19.429  ±  0.0470 

19.260  ±  0.0432 

9.615 

3.460 

Table  6:  Comparison  of  the  p.p.e.  and  p.b.e. 

resulting  point  estimates  and  efficiency  ratios  under  the  smaller  sample  sizes  were  nearly 
identical  to  the  corresponding  quantities  in  Tables  3-6;  however,  as  expected,  the  sample 
size  of  80,000  gave  more  precise  estimates  for  the  various  quantities.  It  is  worth  noting  that 
the  bias  of  the  proposed  estimator  is  less  than  0.006  for  all  seven  scenarios  and  all  three 
sample  sizes.  Because  such  a  small  error  is  unlikely  to  significantly  affect  any  public  health 
policy,  the  proposed  estimator  appears  to  be  sufficiently  accurate  for  policy  making  purposes. 
In  summary,  our  results  demonstrate  that  in  order  to  obtain  the  most  precise  prevalence 
estimates  it  is  necessary  to  adopt  a  pooled  testing  strategy  that  exploits  the  dilution  effect. 
As  a  rough  guideline,  the  proposed  binary  estimator  should  be  adopted  when  the  prevalence 
is  low,  and  the  p.p.e.  should  be  adopted  when  the  prevalence  is  higher,  where  the  cutoff  is 
somewhere  between  1%  and  5%. 

7.5     Application 

To  better  estimate  the  benefits  that  are  achievable  from  our  proposed  procedure,  we  consider 
the  following  hypothetical  application:  it  is  known  that  the  prevalence  is  close  to  5%,  and 
three  options  are  under  consideration,  the  individual  estimator,  the  proposed  binary  esti- 
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mator  and  the  proposed  paxametric  estimator.  A  budget  constraint  is  imposed  such  that, 
if  individual  testing  is  adopted,  at  most  1000  samples  can  be  tested.  The  objective  is  to 
employ  the  procedure  that  will  give  the  smallest  mse. 

Individual  estimator:  Here  n  =  1000  and  m  =  1.  From  the  results  of  experiment  3 
(not  shown),  the  variance  is  expected  to  be  6.2  x  10""*.  Because  the  individual  estimator  is 
unbiased,  this  coincides  with  the  mse. 

Proposed  binary  estimator:  From  Table  2,  m  =  20,  and  to  satisfy  the  budget 
constraint  n  =  ^^^I'^^^^o)  ~  ^65.  From  Table  6,  the  variance  is  (/lOO^f§P)^  »^  = 
1.00  X  10-^  and  the  mse  is  (0.04636  -  0.05)2  ^  ;^  qO  x  10"^  =  2.30  x  10"^ 

Proposed  parametric  estimator:  Here  m  =  80,  and  to  satisfy  the  budget  constraint 
"  =  i.35+aM(80)  ^  220.  From  Table  6,  the  variance  is  expected  to  be  {VTOO^^^f  5«^  = 
5.72  x  10-^  and  the  mse  is  (0.04832  -  O.OS)^  +  5.72  x  10"^  =  8.54  x  10"^ 

Hence,  the  proposed  parametric  estimator  has  the  smallest  mse  and  is  the  preferred 
procedm-e. 

8     Concluding  Remarks 

We  have  developed  a  parametric  procedure  to  estimate  the  prevalence  of  HIV  from  pooled 
samples.  Our  approach  is  novel  in  that  it  captures  the  dilution  effect  and  estimates  the 
prevalence  directly  from  the  continuous  OD  readings.  This  procedmre  was  developed  specif- 
ically for  the  HIV  estimation  problem  but  is  applicable  whenever  liquids  or  gases  are  pooled 
together  and  tested  for  estimation  purposes;  many  apphcations  can  be  found  in  industrial 
or  environmental  quality  control. 

The  procedure  was  tested  on  simulated  data  that  were  generated  from  actual  blood 
samples,  and  the  results  indicate  that  it  is  more  accurate  and  roughly  an  order-of-magnitude 
more  efficient  than  existing  binary  pooling  procedures.  As  expected,  the  benefits  from  this 
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procedure  axe  larger  when  the  prevalence  is  high.  We  also  derived  a  new  binary  procedure 
that  exphcitly  accounts  for  the  dilution  effect.  When  the  prevalence  is  above  5%  our  proposed 
parametric  estimator  was  several  times  more  efficient  than  the  proposed  binary  estimator. 
However,  the  approximations  embedded  in  our  procedures  appear  to  more  accurate  in  the 
binary  setting  than  in  the  continuous  OD  setting,  and  when  the  prevalence  is  below  1%  the 

proposed  binary  procedure  performs  better  than  proposed  parametric  procedure. 

In  conclusion,  the  numerical  results  provide  strong  evidence  supporting  the  adoption 

of  our  parametric  pooled  testing  procedures  for  population  surveys  aimed  at  estimating  HIV 

prevalence.  The  next  step  is  to  determine  whether  the  conclusions  of  this  paper  are  confirmed 

when  the  procedure  is  tested  on  real  data. 

Acknowledgement 

This  research  is  supported  by  National  Science  Foundation  grant  DDM-9057297  and  Ameri- 
can Foundations  for  AIDS  Research  (AmFAR)  grant  02100-15-RG.  The  authors  are  grateful 
to  Elizabeth  Deix  for  providing  the  data,  and  John  Karon  and  Glenn  Satten  for  helpful 
discussions. 

Appendix 
A     Convergence  Proof 

In  this  appendix,  we  show  that  the  estimation  algorithm  in  Section  3  is  an  Expectation- 
Maximization  (EM)  algorithm.  Let  us  define  the  unobserved  vector  z  =  (zi, . . . ,  2„),  where 
Zi  is  the  number  of  infected  individuals  in  pool  i,  and  view  the  raw  data  x  as  the  incomplete 
observations  from  the  complete  data  set  (x,  z).  Hence,  the  complete  log-likelihood  is  given 
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by 


1=1  fc=0 


(47) 


The  expectation  step  of  the  EM  algorithm  calculates  E  /c(x,  z;p)|x,p''''  .  Because 

E[l{z,  =  k)\x;p^'^]=Tk{x,;p^'^),  (48) 

it  follows  from  equation  (47)  that 

n      m 

E  [/e(x,  z;p)|x,p(^)]  =  ^  ^  rfc(x.;p(^)) 


t=l  *:=0 


log  r  p'(i-pr-N+iog(/r^(a:.)) 


The  maximization  step  updates  p^'^  using 


p(^+i)  =  arg  maXpE"  [/c(x,z;p)|x,p' 


is) 


(49) 


(50) 


The  function  E 


/c(x,  z;p)|x,p^*M  is  concave  in  p,  and  hence  p(*+i^  is  the  unique  solution  to 


the  first-order  optimality  condition 


dE 


/c(x,z;p)|x,p'^^ 


dp 


=  0, 


(51) 


which,  after  some  algebra,  gives 


nm 


(52) 


1=1  fc=0 

The  equivalence  of  (52)  and  (11)  estabhshes  that  the  iterative  procedure  is  the  EM  algorithm. 
To  prove  that  the  algorithm  converges  to  a  fixed  point,  we  need  the  following  fact. 

Theorem  2  (Dempster  et  al.):   If  the  incomplete  log-likelihood  is  bounded  from 
above  and 


/e(x,z;p(^+i))|x,p(^)]  -  E  [/,(x,z;p(^))|x,p(^^]  >  X{p^'^'^  -  p^'^f 

for  some  scalar  A  and  all  p'-^^ ,  then  the  sequence  p^^^  converges  to  some  p*  in  [0, 1] . 
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(53) 


To  show  that  L(x;p)  is  bounded  from  above  in  our  problem,  we  define 

M  =  sup{max(7r+(?/),  7r_(?/))}. 
y 

Then  7r*<''''">(T/)  <  M'"  and  from  (1),  7r?'''"^(y)  <  mM"^.  Thus,  it  follows  from  (6)  that 

L(x;p)<[mM"']".  (54) 

To  estabhsh  (53),  we  use  (49)  to  obtain 

^[/,(x,z;p(^+i))|x,pW]  -^[Ze(x,z;p(^')|x,pW;  = 

n      m 

X:E^^(^«;P^'^)  (A:logp<^^^)  +  (m  -  A;)log(l  -p(^+^)))  - 

t=l  fc=0 
n      m 

^  ^r,(x.;p(^))  (fclogp^  +  (m  -  A;)  log(l  -  p^)) .  .  (55) 

i=l  fc=0 

Since  X!r=i  I^fcLo  ^'^fc(^i;p'^^)  =  nrnp^^"*"^',  the  right  side  of  (55)  reduces  to 


nm 


['''""'°8(3  +  ('-^'"""°«(t^))-  (^^' 


To  complete  the  convergence  proof,  we  need  the  following  lemma. 
Lemma  1  For  all  {x,y)  G  [0, 1]^, 


xlog  (£j  +  (1  -  x)log  (^-^]  >{x-  yf.  (57) 


Proof:   Fix  2/  e  (0, 1)  and  define  h{x)  =  xlog  (^)  +  (1  -  a:)log  [^)  -  (x  -  yf.   Setting 
/i'(x)  =  0  gives 

which  is  satisfied  hy  x  =  y.  Also,  h"{x)  =  ^r^  -  2  >  0.  Therefore,  h{x)  is  convex  and  is 
minimized  when  x  —  y.  Because  h{y)  =  0,  it  follows  that 

X  log  [  -  j  +  (1  -  x)  log  (^^)  >{x-  y?   for  all  x  e  (0, 1).  (59) 
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Because  y  is  arbitrary,  relation  (59)  holds  for  every  x  and  j/  in  (0, 1).  The  result  extends 
to  the  closed  interval  [0, 1]  by  continuity.  ■ 

Lemma  1  and  (56)  imply  that 
E  [/,(x,z;p(^+i))|x,p(^']  -  E  [/c(x,z;p('))|x,p(^^]  >  nm{p^'^  -  p'^'^^^f   for  all  p* .       (60) 
Therefore,  the  conditions  of  Theorem  2  hold  and  the  algorithm  converges  to  a  fixed  point 

B     Partial  Derivative  ^[x\y) 

Define  the  constants 

A  =  47(x-l)2, 

B  =  [47(x-l-0)  +  (l-7)0](x-l), 

C  =  (40 -87x)(x-l)  + 02(^-2), 

D  =  3;(47x  -  40(1  +  27)), 

E  =  47x^ 

F  =  -87x2  +  (87 -40)x  + 02(^^2). 


Then 


dy^  40^y''(l +7/T)  \  yT         y-^T         / 
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